Fully Quantized Transformer for Machine Translation

121

FIGURE 5.1

Performance of quantized BERT with varying weight bit-widths and 8-bit activation on

MRPC and MNLI-m.

5.2

Fully Quantized Transformer for Machine Translation

Prato et al. introduce FullyQT, an all-inclusive quantization strategy for the Transformer.

Also, it is the first work to show that it is possible to avoid any loss in translation quality

with a fully quantized transformer [190]. Their method contains four parts: the quantization

scheme, the choice of quantized layer, tensor bucketing, and a unique design for zeros.

5.2.1

Quantization Scheme

The quantization scheme was uniform, meaning that the step size between two quantized

values is constant. This choice, which is an additional constraint, was made for practical

reasons. It simplifies all computations required during inference, enabling the exploitation of

hardware resources more efficiently. Given an element x of a tensor X, uniform quantization

scheme is defined as:

Q(x) =clamp(x; xmin, xmax)xmin

s

,

(5.7)

where xmin and xmax defines the endpoints of the quantization interval. The clamp function

associates all values outside of the [xmax, xmax] range to the closest endpoint, and ⌊·⌉

represents rounding to the nearest integer.

The step size s is computed by:

s = xminxmin

2b 1

,

(5.8)

where b is simply the bit precision.

When quantization is applied to weights, xmin and xmax are respectively min(X) and

max(X). However, when quantization is applied to activations, those values are running